Systems biology approaches to identify driver genes and drug combinations for treating COVID-19

Corona virus 19 (Covid-19) has caused many problems in public health, economic, and even cultural and social fields since the beginning of the epidemic. However, in order to provide therapeutic solutions, many researches have been conducted and various omics data have been published. But there is still no early diagnosis method and comprehensive treatment solution. In this manuscript, by collecting important genes related to COVID-19 and using centrality and controllability analysis in PPI networks and signaling pathways related to the disease; hub and driver genes have been identified in the formation and progression of the disease. Next, by analyzing the expression data, the obtained genes have been evaluated. The results show that in addition to the significant difference in the expression of most of these genes, their expression correlation pattern is also different in the two groups of COVID-19 and control. Finally, based on the drug-gene interaction, drugs affecting the identified genes are presented in the form of a bipartite graph, which can be used as the potential drug combinations.

various analyzes of the obtained bioinformatic networks, including checking the controllability and identifying driver vertices in order to obtain a collection of important and effective genes in COVID-19 [34][35][36][37] .
In addition to network analysis of the omics data and identification of the effective and driver vertices to provide therapeutic solutions related to COVID-19, several studies have been conducted based on machine learning methods to identify signatures and the rules between them to diagnose COVID-19.For example, in an article, multiple machine learning methods on transcriptomics data of upper airway tissue with acute respiratory illnesses led to the identification of effective qualitative biomarkers and quantitative rules for distinguishing COVID-19 patients from other infectious 38 .In another work, the methylation data is first analyzed using the Monte Carlo feature selection method to obtain a feature list, and a decision tree is used to identify methylation features and decision rules that clearly distinguish different cases 39 .In an article, key methylation sites that have distinct patterns among patients with COVID-19 at different ages, have been identified 40 .Also, in another work, the biomarkers related to COVID-19 have been classified according to the severity of the disease 41 , microRNA, methylation signatures, and the rules between them that determine the severity of COVID-19 in different patients have been identified based on machine learning methods 42,43 .
In this manuscript; first, with the aim of considering the results of various researches conducted in order to obtain genes related to COVID-19, genes that have been reported with a valid p Value or with a large number of references in different articles, Collected.Then, by forming a PPI network between related proteins, centrality analysis has been performed and proteins that have many connections have been identified.Next, by targeting these proteins, in the directed network of signaling pathways related to COVID-19, the vertices that have the highest control power have been determined.Then, with correlation and DEG analyzes in the expression data, it has been shown that the identified hub and driver genes mainly in the two groups of COVID-19 and control have significant expression differences and secondly, their expression correlation patterns change.In the end, by using the connections between the existing drugs and the set of genes obtained, the bipartite drug-gene network, the drugs affecting each of the obtained genes have been determined, which could be potential combinations of existing drugs in order to repurpose drugs to treat COVID-19.

Collection of genes related to COVID-19 based on literature
In the first stage, the set of genes related to corona disease have been collected based on the literature.For this purpose, two databases; the CORMINE Medical online and DisGeNET have been used.In the CORMINE database, the set of related genes is ranked based on the p Value criteria.564 genes with p Value less than 0.05 were selected (Table S1).The DisGeNET database has ranked the set of related genes based on the number of references.Genes with more than 5 references have been selected, which is equal to 296 genes (Table S2).Due to the commonality of some genes, 757 genes related to COVID-19 that have many references or small p Value were collected from the above two sets, which were used for further analysis.

PPI network construction and analysis
In the following, among the collected genes related to COVID-19, genes that have many connections and effects on other genes have been identified.For this purpose, the PPI network between the corresponding proteins is formed based on the STRING database in which, functional and structural relationships is considered (Fig. 1a).The 10 proteins with the highest degree in the constructed network are indicated in the figure along with the number of connections.Also, the set of all proteins with a degree greater than 50 have been identified.
Next, it has been investigated that the identified hub proteins necessarily interact with the COVID-19 proteins.For this purpose, the human genes associated with COVID-19 which were obtained based on the virus-human protein interaction network and using the gene ontology, have been used 44 .Out of 10 identified hub proteins, 9 proteins have interactions which are considered for further analysis.(ACTB protein has no interaction) Also, out of 310 proteins with a degree higher than 50, 239 proteins have interaction and 71 proteins have no interaction, which are categorized into two groups in Table S3.A set of 239 proteins with a degree of more than 50 and having interaction, are considered control targets in signaling pathways related to COVID-19.

Controllability of signaling pathways related to COVID-19
Because the directional network related to the signaling pathways of COVID-19 includes the effect and effectiveness relationships between the genes in the network and specifies functional relationships that different genes have on each other during the progression of the disease.To analyze the controllability, the signaling pathways in the KEGG database 45 have been used.At this stage, the directed network related to the signaling pathways of COVID-19 has been extracted.(Coronavirus disease-COVID-19-Homo sapiens) (Fig. 1b) Next, common vertices are obtained in the directed network, and the set of proteins with a degree higher than 50 and interacting with COVID-19 proteins of the previous step is considered as the control target.(Vertices that marked with green color) and then based on the target controllability algorithm with the least mediator vertices 33 , 10 vertices that have the most power to control the target set have been determined (the set of driver vertices is marked in the figure).
Because each of the obtained vertices, according to the definition of the vertices with the highest control power, must control a large number of the set of target vertices, and on the other hand, each of the vertices of the target set has many connections in the PPI network, so the obtained driver vertices should affect the set of many proteins in the PPI network between the proteins related to COVID-19.
IL6 protein is among the top 9 proteins in the PPI network and is also among the top 10 drivers in signaling pathways.Therefore, a total of 18 proteins have been considered as a set of hub and driver proteins for further analysis.Driver and hub genes identified, have subscription with the results of machine learning methods to identify signatures associated with COVID-19.For example, the TNF gene is one of the essential biomarkers of COVID-19 38 , or TNF and INS genes have been obtained as transcriptional biomarkers for severe COVID-19 by machine learning methods 41 .

Analysis of expression data
In the following, to evaluate the obtained results, analysis of expressive data has been done.For this purpose, expression data set GSE163151 was used, in which 404 expression profiles were collected from nasal swabs and blood of healthy people and people suffering from COVID-19 and other viral and bacterial infections 46 .Of the total data, 31 data profiles are from healthy people (control group) and 145 data profiles are from people with COVID-19.First, DEG analysis has been performed between two control and COVID-19 groups.Figure 2a,b show the Principal component analysis in two groups and the heatmap of 50 genes with the highest expression changes 15 .Then the co-expression analysis between the hub and driver genes obtained in the previous steps has been performed.Figures 2c,d show the level of expression correlation in the COVID-19 group and the control group, respectively.The results show the correlation changes in gene expression in the two groups.In general, the correlation is higher in the COVID-19 group.
Of the 18 selected hub and driver genes, 13 genes with a p Value less than 0.05 have expression differences in the two groups of COVID-19 and control, as shown in Table 1.Also, AveExpr and Log2FC of each gene are expressed in the second and third columns respectively.

Identification of drugs related to the obtained gene set
In order to identify potential drug combinations against COVID-19, existing drugs that affect each of the identified hub and driver genes based on DrugCentral 47 and KEGG pathways were presented as potential drug targets.The related genes and drugs are indicated in the bipartite graph format shown in Fig. 3.In the following, Similar to the process used to find candidate drugs for lung cancer 48 , based on the STITCH database, which identifies chemical-chemical and chemical-proteins interaction, five chemical compounds corresponding to each of the 18 driver and hub genes have been determined, which are presented in figures S1 and S2.
There is considerable evidence of relationships between the identified drugs and Remdesivir, which is a wellknown and widely used antiviral medicine that works by stopping the virus that causes COVID-19.including: The metabolism of Remdesivir can be decreased when combined with Midostaurin 49 .The metabolism of Bortezomib can be decreased when combined with Remdesivir 49 .The metabolism of Digitoxin can be decreased when combined with Remdesivir 49 .The excretion of Raloxifene can be decreased when combined with Remdesivir 50 .The metabolism of Sunitinib can be decreased when combined with Remdesivir 49 .The metabolism of Ceritinib can be decreased when combined with Remdesivir 49 .The metabolism of Crizotinib can be decreased when combined with Remdesivir 49 .The metabolism of Ruxolitinib can be decreased when combined with Remdesivir 49 .The metabolism of Tofacitinib can be decreased when combined with Remdesivir 51 .The serum concentration of Remdesivir can be increased when it is combined with Fedratinib 52 .The metabolism of Nintedanib can be decreased when combined with Remdesivir 49 .Details of each of the 61 drug-gene interactions identified in   Fig. 3 between 14 genes and 48 drugs are specified in Table S4.Where the signaling pathways and mechanism of drug action on the relevant gene, structure id, target class, accession, action value, and action type, are specified.

Conclusion
Due to the importance of finding a treatment solution for COVID-19, various omics data related to it have been created.Also, many bioinformatics researches have been conducted, which depending on the data selection and analysis method, have produced sometimes different results.Therefore, in the direction of a comprehensive analysis, it is necessary to consider all the presented results.In this paper, to create a PPI network related to COVID-19, using the results presented in different articles, a set of genes has been selected that have either a large number of references confirming its importance with COVID-19 or based on the criterion p Value is very important in the development of COVID-19.In the PPI network, which includes the functional and physical connections of the selected proteins, the vertices with the highest degree are selected.Therefore, the change in the concentration of obtained proteins will affect a large number of proteins related to COVID-19.Also, the controllability analysis of the directed network related to the signaling pathways of COVID-19 has led to the identification of driver vertices.Network control methods look for vertices in the network that have a stimulating role and when they are affected by external signals such as drugs or any other treatment method, they can hierarchically affect all network vertices and create the desired change.The above analyzes were not performed independently.Rather, the control target vertices are proteins that have many connections in the PPI network.Since the high control power is due to the presence of long paths and the high degree is due to many connections, the high control power of the vertex and the high degree of the vertex in the network are complementary to each other.Therefore, when the controlled vertices have a high degree, the influence of the stimulating vertices in the network will be doubled.
In order to evaluate and check the significance of the obtained results, expression data analysis has been done.The results confirm that the expression of the set of genes obtained is mainly different in the two groups of COVID-19 and control, and on the other hand, the correlation of the expression of pairs of genes is different in the two groups.Finally, the most important drugs that affect each of the obtained genes are shown in the form of a bipartite network, which can potentially provide new drug suggestions for the treatment of COVID-19.
The process carried out in this paper to provide drug combinations for COVID-19, Collects and filters diseaserelated genes based on literature.Then it performs centrality and controllability analysis on different PPI networks and the related signaling pathways to identify hub and driver genes.In the following based on the analysis of expression data, the obtained results are evaluated.Finally, based on various drug-gene relationships, appropriate drug combinations are suggested, which can be a general pipeline that can be considered for any other disease.

Hub vertices selection
Determination of hub vertices in the PPI network is based on the degree centrality criterion.The 10 proteins with the highest degree have been selected as the 10 hub proteins.www.nature.com/scientificreports/

Controllability of complex networks
A dynamical system is controllable if, by entering appropriate input signals, the state of vertices can be transferred from any position to any desired position with a finite number of steps.Checking the exact controllability is based on the Kalman's controllability rank condition.On the other hand, checking the structural controllability is based on the minimum input theorem, which determines the relationship between maximum matching and driver vertices based on the Lin's structural controllability theorem.

Controllability with minimum mediator
The target controllability algorithm with minimum mediator, which is based on the length of the paths between vertices in the network, identifies driver vertices that can control the desired target with the least number of intermediary vertices.

Control centrality
As a centrality measure, it seeks to identify the vertices that control the largest number of vertices in the whole network or desired target.10 vertices with the highest control power in the directed network related to the KEGG pathway signaling have been identified.

Figure 1 .
Figure 1.Bioinformatic networks related to COVID-19.(a) Undirected PPI network obtained from the STRING database between the identified proteins related to COVID-19 based on the literature.The 10 proteins with the highest degree are identified.ACTB protein has no interaction with COVID-19 proteins.(b) Directed network related to the signaling pathways of COVID-19 obtained from the KEGG database.Common vertices with the set of vertices with a degree greater than 50 in the PPI network and interacting with COVID-19 proteins are considered as control targets.(Green vertices).The 10 driver vertices with the highest control power for the considered target are specified in the network (written in a larger format).

Figure 2 .
Figure 2. Expression data analysis results (a) Principal component analysis in two groups of COVID-19 and control.(b) Heatmap of 50 genes with the most expression changes.(c) Expression correlation of hub and driver genes in the group of COVID-19.(d) Expression correlation of hub and driver genes in control group.

Figure 3 .
Figure 3. Bipartite network of connections between genes and drugs.

Table 1 .
The set of hub and driver genes with P Value less than 0.05 in the two groups of COVID-19 and control have expression differences.